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Kernel  Benchmarks  and  Metrics  for  Polymorphous  Computer  Architectures 
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Polymorphous  computer  architectures  (PCA)  are  new  computer  architectures  being 
developed  under  a  DARPA/IPTO  program  to  support  mission  agility  for  future  high 
performance  DoD  embedded  applications.  These  new  architectures  will  have  the  ability 
to  “morph”  into  different  modes  of  execution  with  the  goal  of  delivering  uniform,  high 
performance  across  a  large  variety  of  different  processing  types  and  workload 
compositions.  Examples  of  these  architectures  include  the  MIT  RAW  machine  [5],  the 
Stanford  Smart  memories  project  [3],  and  the  University  of  Texas  TRIPS  machine  [4]. 

To  evaluate  the  applicability  of  PCA  to  next  generation  ISR  (intelligence,  surveillance, 
reconnaissance)  applications,  MIT  Lincoln  Laboratory  has  developed  example 
applications  and  kernel  benchmarks  that  span  the  space  of  embedded  ISR  application 
requirements.  Matlab  code  for  an  example  ISR  application,  with  elements  of  feature- 
aided  tracking  [6],  is  being  analyzed  by  teams  developing  PCA  architectures.  In  addition, 
seven  kernel  benchmarks  that  represent  important  pieces  of  this  application  have  been 
defined.  These  seven  kernels  are  FIR  filter,  singular  value  decomposition,  constant  false- 
alarm  rate  (CFAR)  detection,  comer  turn,  pattern  matching,  graph  optimization  via 
genetic  algorithm,  and  database  search. 

An  important  first  step  in  evaluating  PCA  architectures  is  the  implementation  of  these 
kernel  benchmarks  on  processors  used  in  modem  embedded  applications.  This 
implementation  provides  a  baseline  for  future  comparisons.  MIT/LL  has  implemented 
these  seven  kernels  on  the  PowerPC  G4  processor.  The  results  show  that  the  throughput 
varies  considerably  from  kernel  to  kernel.  This  variation  in  performance  is  reflected  in  a 
metric  known  as  stability.  Defined  by  Kuck  [2],  stability  is  the  ratio  of  the  minimum  to 
the  maximum  throughput  for  a  particular  set  of  problems.  A  chief  benefit  of  PCA 
architectures  is  expected  to  be  their  stable  performance  across  a  range  of  kernels  and  data 
sizes. 

Hoffman  [1]  has  implemented  convolution  and  many  other  kernels  on  the  RAW 
simulator  using  scalable  systolic  algorithms.  Hoffmann’s  throughput  results  for  real 
convolution  on  a  simulated  250  MHz  RAW  are  shown  in  Figure  1  and  compared  with  a 
similar  kernel  on  a  500  MHz  G4.  Both  machines  have  a  peak  throughput  rated  at  4 
Gflops/sec.  Clearly,  the  simulation  results  show  that  RAW  has  the  potential  to  perform 
much  better  than  the  G4  on  this  kernel. 

In  this  talk,  we  present  and  analyze  performance  results  for  several  PCA  kernels  on  the 
MIT  RAW  simulator  and  on  a  RAW  test  board.  We  compare  these  with  the  baseline 
performance  results  obtained  on  the  PowerPC  G4  in  terms  of  throughput,  stability, 
efficiency  and  power  efficiency. 
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MIT-LL  Surveyed  DoD  Applications  to  Provide: 

•  Kernel  Benchmark  Definitions 

•  Example  Requirements  and  Data  Sets 
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For  a  given  application,  PCA 
processors  should  achieve  higher 
product  of  throughput  and  stability 
than  conventional  processors 
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Characteristics: 

•  Rigid  memory  hierarchy 

•  Rigid  datapath 

•  Speciaiized  Structures 

High  Performance  Programming: 

•  Change  aigorithm  to  match 
memory  hierarchy 

•  One  degree  of  freedom 

•  Can  oniy  work  with  biocking  factor 


Characteristics: 

•  Fiexibie  memory  hierarchy 

•  Fiexibie  datapath(s) 

•  Generic  Structures 

High  Performance  Programming: 

•  Co-optimize  aigorithm  and 
architecture 

•  Many  degrees  of  freedom 

•  Optimize  time/space  tradeoff 


PCA  provides  more  degrees  of  freedom,  and  thus  greater  fiexibiiity 
(morphabiiity)  and  greater  performance  over  a  range  of  appiications 
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PowerPC  G4  7410  Specs 

-  500  MHz  Clock  rate 

-  4  Gflop/s  peak 

-  125  MHz  main  memory  bus 

-  LI  cache:  32  kB,  on  chip 

-  L2  cache:  2MB,  250  MHz  bus 

-  Mercury  daughtercard 


*  Two  predictors  of  kernel  performance: 

*  Programmer’s  maximization  of  data  reuse  and  iocaiity  (biocking  factor) 

•  Memory  hierarchy  of  G4 

*  Blocking  factor  determines  max  achieved  performance 

*  Memory  hierarchy  determines  shape  of  performance  curve 

*  Want  to  maximize  blocking  factor  to  limit  memory  hierarchy 
bottleneck 
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PowerPC  G4  (Mercury) 

•  500  MHz 

•  Peak:  4  GFLOPS/sec 


Mean  Efficiency:  29% 


*lmplemented  with  VSIPL  Real  FIR  Filter 


Caches  are  performance  bottienecks 

-  Performance  curve  changes  when  cache  is  full 

-  Product  metric  penalizes  G4  for  performance 
drop  at  cache  boundaries 
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where 

N  =  problem  size 
R  =  edge  length  of  tile  array 
C(N)  =  number  of  operations 
T(N,R)  =  number  of  time  steps 
P(R)  +  M(R)  =  total  number  of  processors 

Compute  Efficiency  Condition:  Urn  E(?  ,R)  =  1 

?,/??  ? 

where  1  =  N/R 


Stream  algorithms  achieve  high  efficiency  by 
optimizing  time  space  tradeoff  -  tailoring 
memory  hierarchy  and  datapaths  to  specific 
needs  of  application 
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Stream  algorithms  achieve  high  performance  by  removing 
memory  access  bottleneck  from  computational  critical  path 
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Raw  implements  the  appropriate  memory  hierarchy  for  the  problem 
Raw’s  Throughput  x  Stability  score  stays  high 
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*  SVD  is  becoming  more  wideiy  used  in  signal  and  image 
processing 

-  Important  for  spectral  analysis 

-  Can  also  be  used  for  adaptive  beamforming,  especially  for  ill- 
conditioned  problems 

*  SVD  kernel  implementation  is  a  Reduced  SVD  that  begins 
with  a  QR  factorization  if  M  >  N 


-  Uses  Modified  Gram-Schmidt  QR  factorization 

-  Many  possible  optimizations,  especially  block  factorization 
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PowerPC  G4  (Mercury) 

•  500  MHz 

•  Peak:  4  GFLOPS/sec 
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Modified  Gram-Schmidt  QR  factorization  of  a  16- 
column  complex  matrix 

MGS  is  about  60%  of  SVD  time 

LI  cache  drives  inner  loop  performance 

-  1 :  A+R  fills  LI  cache 

-  2:  One  column  of  A  is  half  o^LI  cache 
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-  maximizes  time/space  efficiency 
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-  Fast  Givens  approach  for  QR/LQ 
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The  QR  performance  demonstrates  the  benefit  of  the  PCA 
approach  on  matrix  algebra  operations 
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RAW  Test  Board 
(October  2003) 

•  2  MB  DRAM 

•  High  Speed  I/O 
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•  High  Speed  A/D 
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Conclusions 


*  MIT  Lincoln  Laboratory  has  defined  kernel  benchmarks  for 
the  PC  A  program 

-  Multiple  categories  of  processing 

-  Based  on  DoD  application  needs 

*  Establishing  a  performance  baseline  on  conventional 
architectures 

-  Performance  is  limited  by  the  blocking  factor  and  by  the 
memory  hierarchy 

-  Example:  CFAR  -  low  ops/byte,  3%  efficiency:  FIR  -  high 
ops/byte,  29%  efficiency 


*  PCA  processors  allow  opportunities  for  high  performance 

-  Performance  achieved  through  co-optimization  of  the 
algorithm  and  the  architecture 

-  Example:  unusual  SVD  algorithm  leads  to  high  performance 
on  Raw 

-  The  greater  degree  of  freedom  allows  greater  optimization 
across  a  variety  of  problem  domains 
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